PROJECT REPORT Infrastructure for Large-scale Data Resource Sharing & An Environmental Case Study

ثبت نشده
چکیده

Large-scale data sharing is nontrivial due to the large-scale, autonomous, and heterogenous nature of data sources. This project represents our initial effort on this problem. The main goal of our project is to identify requirements, system architectures, and key technology barriers to establish an ICT infrastructure to support large-scale data resource sharing between research institutions. To achieve this goal, in the report, we first investigate a few main issues which are important for large-scale data sharing, such as interoperability, extensibility, and scalability, and meanwhile, highlight the necessity for possible synergies among several technologies like Grid, Peer-to-Peer (P2P), and data integration technologies. We then propose a novel service-oriented architecture, which is designed specially for large-scale data sharing. And finally, we give an environmental case study and report our experiences. 1 Motivation, Scope, and Goals Modern, large-scale data sharing is typically characterized by the large volumes of data involved, and the heterogeneity of data sources accessed [23]. For example, in order to answer complex biological questions, biologists have to access and analyze large quantities of biological data which are stored over widely distributed repositories. These repositories, each making its own decision about data storage and retrieval, are highly autonomous and heterogeneous. For instance, they may describe the same data objects using different representations, e.g., protein sequence used in SWISS-PROT, and structure in the Protein Data Bank (PDB). To share data under such circumstances, one possible approach is data replication. That is, data to be shared are first replicated to local repositories or a central repository before any processing (e.g., data mapping, transformation). Though simple, such an approach suffers from apparent limitations like unnecessary bandwidth cost and high maintenance cost. Moreover, sometimes it may not be feasible due to privacy reasons. Data integration, on the other hand, avoids above limitations with data replication, by allowing the flexible and managed federation, exploration, and processing of data from distributed sources [9]. Over the last decade, much effort has been put into data integration from various communities. However, till today, it is still a great challenge to support data sharing in a large scale. To get more insight into the problem, and thus facilitate developing sophisticated techniques for large-scale data sharing, we focus our study in the Environmental Science area. We chose Environmental Science as our focus area based on the following considerations: (1) it exemplifies this problem with non-trivial yet not overly complex data and models; (2) it is an area with a clear need for national and international collaboration, which has not yet become achievable due to issues many of which modern information and communications technology is well positioned to address; (3) it itself is an important area and our results can be applied to generate immediate and significant benefits; and (4) the Queensland EPA is a committed partner who provides large and complex real data, spatial models, operational environment, user requirements and domain expertise. Our research, however, is not limited to environmental sciences, but aims at supporting all data intensive scientific research. The main goal of our project is to identify requirements, system architectures, and key technology barriers to establish an ICT infrastructure to support large-scale data resource sharing between research institutions. Specifically, we hope to achieve the followings: – Insight knowledge about important issues involved in large-scale data sharing; – Insight knowledge about key technologies (e.g., strength, weakness) and their roles in large-scale data sharing; – The design of a large-scale, general purpose infrastructure to support data intensive applications; – A working prototype to support data sharing among a selected collection of geospatial data sources (centred around the WildNet database from the EPA). To achieve the above, in the report, we first investigate a few main issues which are important for large-scale data sharing, such as interoperability, extensibility, and scalability, and meanwhile, highlight the necessity for possible synergies among Grid, P2P, and data integration technologies: by combining Grid and data integration technologies, we facilitate the interoperability among heterogeneous data sources; by integrating P2P technologies into both Grid and data integration technologies, we improve the extensibility and scalability of data sharing. We then propose a novel service-oriented architecture based on these technologies, which is designed specially for large-scale data sharing. Finally, we give an environmental case study and report our experiences. The rest of the report is organized as follows: Section 2 describes the state of the art of large-scale data sharing, including important technologies, their roles, and recent efforts; Section 3 investigates main issues involved in large-scale data sharing, and highlights the necessity for technology integration; Section 4 presents the proposed architecture; Section 5 gives an environmental case study; and Section 6 summarizes what we have achieved and points out the future work. 2 The State of the Art In this section, we give a review of the state of the art of large-scale data sharing, including important technologies (i.e., Grid, P2P and data integration technologies), and recent efforts in integrating these technologies for large-scale data sharing. 2.1 Grid Technologies Grid technologies and infrastructures aim at supporting “coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations” [12]. The Open Grid Services Architecture (OGSA) [10] is designed to facilitate the interoperability among different Grid deployments, which aligns Grid technologies with Web Services technologies, and introduces a service-oriented paradigm into the Grid. The first formal and technical specification of OGSA is the Open Grid Services Infrastructure (OGSI) [32], which has several implementations such as Globus Toolkit 3.0 (GT3) [14]. Currently, OGSI is evolving towards the Web Services Resource Framework (WSRF) [38] to embrace new Web Services standards. OGSA and OGSI OGSA adopts a common representation for all resources (e.g., computational and storage resources, programs, databases): each resource in OGSA is represented as a Grid service, i.e., a Web service that provides a set of well-defined interfaces and follows specific conventions [11]. OGSI further specifies the basic interfaces (or portTypes) to be implemented by Grid services, such as GridService (GS), Factory, ServiceGroup, and so on. Most of these interfaces are optional, except GridService, which is a mandatory interface, and must be implemented by all Grid services. Depending on what interfaces are implemented, Grid services with different functionalities may result. Grid services can be instantiated dynamically. Each instantiation of a Grid service generates a Grid service instance which is identified by the Grid Service Handle (GSH) and the Grid Service Reference (GSR). The difference between GSH and GSR is that, GSH is invariant, while GSR is stateful and can change over the life time of a service instance. To create a service instance, a Grid service, called ‘factory’, is invoked, which implements the Factory interface (services and factories are located by another Grid service, called ‘registry’, which implements ServiceGroup (SGR) portType). OGSA-DAI/DQP Both OGSA-DAI and OGSA-DQP build upon OGSA. The main objective of OGSA-DAI/DQP is to provide a uniform service interface for data access and integration over the Grids. OGSA-DAI extends Grid services with new services and portTypes for individual data access, such as Grid Data Service (GDS), Grid Data Transport (GDT), Grid Data Service Factory (GDSF), and DAI Service Group Registry (DAISGR). Grid Data Service is the primary OGSA-DAI service, which supports data access through the GDS portType (via the perform operation) and data delivery through GDT portType. GDS instances are created by invoking GDSF which can be located by DAISGR. OGSA-DQP extends OGSA-DAI with two new services (and their corresponding factories) for distributed query processing over multiple data sources: a Grid Distributed Query Query Service (GDQS) which compiles, optimises, partitions and schedules distributed query execution plans over multiple execution nodes in the Grids, and a Grid Query Evaluation Service (GQES) which is in charge of a partition of the query execution plan assigned by a GDQS. In GDQS, a Grid Distributed Query (GDQ) portType is added for importing source schemas. OGSA-DQP itself does not do any schema mediation. 2.2 P2P Technologies P2P technologies share the same final objective as Grid, i.e., to pool large sets of resources, however, they address different requirements and thus have different design approaches. In general, P2P technologies focus more on decentralization and scalability, while Grid technologies focus more on providing various complex services. Three main classes of P2P systems have emerged so far: distributed computing, file sharing, and collaborative, among which, file sharing systems is the most studied. Based on whether there is any constraint on network topology or on data placement, file sharing systems are further classified into two main kinds: unstructured (e.g., [15]) or structured (e.g., [30]). Our work is most related to super-peer networks [39], one kind of unstructured networks, which strikes a balance between the inherent search efficiency of centralized systems and the robustness of decentralized systems. Also, this kind of networks can take advantage of the heterogeneity among the capabilities of participating peers. In a super-peer network, some peers with more capability (e.g., more bandwidth, or CPU) take on role as super-peers, and act as servers to a set of clients (peers with less capability) in the network. A good survey about P2P technologies appears in [24]. 2.3 Data Integration Technologies Data integration technologies have been extensively studied over the last decade, and a lot of work has been done. Traditionally, in the database community, data integration systems are characterized by an architecture based on a global schema and a set of sources, and a crucial aspect in these systems is modelling the relation between the sources and the global schema [20]. Two approaches has been proposed: one is globalas-view (GAV) [34] where the global schema is expressed in terms of the sources; the other is local-as-view (LAV) [17] where each source is defined as a view over the global schema. Regardless of the approach used, during query processing, a query posed over the global schema needs to be reformulated in terms of a set of queries over the sources. A fundamental operation related to modelling is schema matching, where a match operation is a function that takes two schemas as input and returns a match result which includes a set of mapping elements matching elements of one schema to the other schema. [28] gives a survey of approaches to automatic schema matching. However, schemas may have some semantics that affects the matching criteria but is not adequately captured or formally expressed. In such a case, two semantically related schemas may seem unrelated. One solution to this is using ontologies. An ontology is “a formal, explicit specification of a shared conceptualization” [16]. In particular, ontologies are used to capture some shared understanding of a domain: main concepts and their important relationships. By using ontology-based approaches for data annotation, it becomes easier to achieve semantic integration. A single ontology for all is desirable, but unrealistic. Multiple ontologies may be developed either independently or based on a common upper ontology, and mappings between them need to be established to facilitate interoperability. [26] provides a brief survey of ontology-based approaches, and [19] reviews the state of the art of ontology mapping.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The resource-constraint project scheduling problem of the project subcontractors in a cooperative environment: Highway construction case study

Large-scale projects often have several activities which are performed by subcontractors with limited multi-resources. Project scheduling with limited resources is one of the most famous problems in the research operations and optimization cases. The resource-constraint project scheduling problem (RCPSP) is a NP-hard problem in which the activities of a project must be scheduled to reduce the p...

متن کامل

Towards Measuring the Project Management Process During Large Scale Software System Implementation Phase

Project management is an important factor to accomplish the decision to implement large-scale software systems (LSS) in a successful manner. The effective project management comes into play to plan, coordinate and control such a complex project. Project management factor has been argued as one of the important Critical Success Factor (CSF), which need to be measured and monitored carefully duri...

متن کامل

Identifying and Analyzing Coordination Barriers in the Context of Urban Infrastructure Provision in Iran A Qualitative Multiple Case Study

Introduction: Urban infrastructure systems provide foundations for modern civil communities and enhance the quality of life. Coordination between different urban infrastructure agencies involved in urban infrastructure provision plays a significant role in the success of these critical urban sub-systems. It brings together various independent agencies to make their endeavors more accordant. In ...

متن کامل

An Integrated Baseline Geodatabase for Facilitating the Environmental Impact Assessment Process: Case Study of Sabalan Geothermal Project, Iran

Baseline data represent one of the important stages of Environmental Impact Assessment (EIA) procedure that describes the existing environment of the study area and surrounding areas in enough detail to allow the environmental impacts of the proposed area to be accurately and adequately assessed, and future changes and effects can be measured. Baseline data may be inaccurate, difficult to obtai...

متن کامل

XtreemOS: a Sound Foundation for Cloud Infrastructure and Federations

XtreemOS is a Linux-based operating system with native support for virtual organizations (VO’s), for building large-scale resource federations. XtreemOS has been designed as a grid operating system, supporting the model of resource sharing among independent administrative domains. We argue, however, that the VO concept can be used to establish either resource sharing or resource isolation, or e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006